With this worksheet, you will learn the first steps with Jupyter, Python, pandas and matplotlib using a practical example: We execute an analysis of the futuristic data set "curse log" – a log of time-tracked swearwords events gathered with the microchip implant "FUTURE 2000" from several humans.
We managed to get the log of uttered curses of some users. We also have some user profile data that we can combine with the curse log to find out, who the cursers are.
As a starting point, we have a log from several users recorded in a file that lists the time stamp, the curse word and the id of the user's profile each curse:
timestamp,curse,profile_id
2132-12-31 14:47:43,The A-word,0
2132-12-31 13:13:56,The F-word,0
2132-12-31 13:03:05,The S-word,0
2132-12-31 12:30:34,The F-word,0
2132-12-31 12:29:02,The S-word,0
Hint: This data is a data set from the future. But since time machines aren't invented yet, we are working with a generated / synthetic dataset based on a real data set from an other domain.
Let's get to know the tools we use!
In [1]:
"Hello World"
Out[1]:
ESC
key.b
key. m
.Enter
(note the color to the left of the cell, which turns green instead of blue).Ctrl
+ Enter
.This is a text
In [2]:
text = "Hello World"
text
Out[2]:
In [3]:
text[0]
Out[3]:
In [4]:
text[-1]
Out[4]:
In [5]:
text[2:4]
Out[5]:
In [6]:
text.upper()
Out[6]:
split
function of text
.Shift
+Tab
.Shift
+Tab
twice in quick succession.Shift
+Tab
four times in quick succession (and then ESC
to hide) text
with split
exactly once (parameter maxsplit
) apart by using the l
("L") as separator (parameter sep
).
In [7]:
text.split("l",maxsplit=1)
Out[7]:
In [8]:
import pandas as pd
pd?
In [9]:
log = pd.read_csv("../dataset/curse_log.gz")
log.head()
Out[9]:
In [10]:
log.info()
We see that log
is
timestamp
, curse
and profile_id
curse
in log
with the method value_counts()
.top_curses
.top_curses
.Note: In this tutorial, we access Series directly with the .<Series>
notation (e. g. log.curse
). This works only if the names of the Series are different from the provided functions of a Series. E. g. it doesn't work, when you try to access a Series named count
, because count()
is a function of a Series. Here, you have to use the ['<Series name>']
notation (e.g. log['count']
. When in doubt, always use the ['<Series name>']
notation (but which disables the auto-completion feature)
In [11]:
top_curses = log.curse.value_counts()
top_curses
Out[11]:
In [12]:
%matplotlib inline
top_curses.plot()
Out[12]:
In [13]:
top_curses.plot.bar()
Out[13]:
In [14]:
top_curses.plot.bar();
In [15]:
top_curses.plot.pie();
In [16]:
top_curses.plot(
kind='pie',
title="Top curses",
label="",
figsize=[5,5]);
In [17]:
log.timestamp.head()
Out[17]:
In [18]:
ts = pd.to_datetime(log.timestamp)
ts.head()
Out[18]:
In [19]:
log['timestamp'] = ts
log.head()
Out[19]:
In [20]:
log['hour'] = log.timestamp.dt.hour
log.head()
Out[20]:
In [21]:
curses_per_hour = log.hour.value_counts(sort=False)
curses_per_hour.head()
Out[21]:
In [22]:
curses_per_hour.plot.bar();
Now it's time to find out, which users are cursing the most. We have another data set in the Excel file profiles.xlsx
with the following content:
The columns contain this information for all FUTURE 2000 users. It includes the unique identification number of a user's profile (matches profile_id
in the curse log) as well as the name, birth date, sex and current job.
We combine this data with our log
DataFrame to check off the remaining items on our to-do list:
Find out
In [23]:
profiles = pd.read_excel("../dataset/profiles.xlsx")
profiles.head()
Out[23]:
join()
on the log
DataFrameprofiles
on='profile_id'
to join log
's profile_id
column with the id (=index) column of the profiles
data set.curse_profiles
.curse_profiles
.
In [24]:
curse_profiles = log.join(profiles, on='profile_id')
curse_profiles.head()
Out[24]:
In [25]:
curse_profiles.name.value_counts().head(10)
Out[25]:
'M'
) with the selector notation <DataFrame>[<DataFrame>.<Series> == <value>]
sex
.sex
of the DataFrame curse_profiles
.
In [26]:
curse_profiles[curse_profiles.sex == 'M'].sex.count() / curse_profiles.sex.count()
Out[26]:
curse_profiles
' data along job
and curse
by using groupby
and the list ['job', 'curse']
as argument.sex
(or any other left Series).job_curses
.
In [27]:
job_curses = curse_profiles.groupby(['job', 'curse']).sex.count()
job_curses.head(10)
Out[27]:
In [28]:
max_per_job = job_curses.groupby('job').transform('max')
max_per_job.head()
Out[28]:
In [29]:
favorite_curses_per_job = job_curses[job_curses == max_per_job]
favorite_curses_per_job.head()
Out[29]:
In [30]:
favorite_curses_per_job.reset_index().curse.value_counts()
Out[30]:
In [31]:
favorite_curse_words = favorite_curses_per_job.unstack()
favorite_curse_words.head()
Out[31]:
In [32]:
favorite_curse_words[~favorite_curse_words['The S-word'].isnull()]
Out[32]:
You have now learned some basics about pandas. This will get us a long way in our daily work. The other important topics that are still missing are:
merge
pivot_table
. I hope that this mini-tutorial will show you the potential of data analysis using Jupyter, Python, pandas and matplotlib!
I am looking forward to your comments and feedback!
Markus Harrer
Blog: https://www.feststelltaste.de
Mail: talk@markusharrer.de
Twitter: @feststelltaste
Consulting and training: http://markusharrer.de